PCA on Nutrition data

Download the nutrition data from github. It contains several thousand food items and some standard nutritional content (calories, fat, vitamins, etc) along with a food group designation. This data originates from the USDA and note that all items are measured on 100g, making observations inherently comparable.

load("~/Downloads/nutrition.rdata")
dim(nutrition)
## [1] 8463   28

It’s noteworthy that there are some missing values in this data, but what happens if we simply remove all NA cases…

dim(na.omit(nutrition))
## [1] 2184   28

Seems a bit extreme. Maybe we should check which columns are containing a lot of NAs…

colSums(is.na(nutrition))
##          food      calories       protein carbohydrates     total_fat 
##             0          3802             0             1             0 
## saturated_fat      caffiene   cholesterol         fiber    folic_acid 
##           340          3485           362           660          1817 
##        sodium       calcium          iron     magnesium     manganese 
##            83           355           146           726          2109 
##        niacin    phosphorus     potassium    riboflavin      selenium 
##           722           616           457           698          1757 
##       thiamin     vitamin_A    vitamin_B6   vitamin_B12     vitamin_C 
##           720           684           984          1233           778 
##     vitamin_D          zinc         group 
##          3332           744             0

Unfortunately, the NAs are quite spread out among several measures that we probably don’t want removed (calories for example). If we were very focused on this application, we might delve a bit further to find out why there are so many NAs, and whether they should, say, be treated as 0’s instead.

nutrition <- na.omit(nutrition)

Always a good idea to explore the data a bit further…

summary(nutrition)
##      food              calories        protein      carbohydrates    
##  Length:2184        Min.   :  0.0   Min.   : 0.00   Min.   : 0.0000  
##  Class :character   1st Qu.: 65.0   1st Qu.: 1.74   1st Qu.: 0.3375  
##  Mode  :character   Median :171.0   Median : 7.49   Median : 8.5250  
##                     Mean   :201.1   Mean   :11.48   Mean   :20.3643  
##                     3rd Qu.:308.0   3rd Qu.:20.70   3rd Qu.:27.2475  
##                     Max.   :876.0   Max.   :88.32   Max.   :99.7700  
##    total_fat      saturated_fat         caffiene         cholesterol     
##  Min.   : 0.000   Min.   : 0.00000   Min.   :   0.000   Min.   :   0.00  
##  1st Qu.: 0.340   1st Qu.: 0.06075   1st Qu.:   0.000   1st Qu.:   0.00  
##  Median : 3.060   Median : 0.78800   Median :   0.000   Median :   0.00  
##  Mean   : 8.713   Mean   : 2.90595   Mean   :   6.731   Mean   :  39.66  
##  3rd Qu.:12.383   3rd Qu.: 3.61650   3rd Qu.:   0.000   3rd Qu.:  67.00  
##  Max.   :99.480   Max.   :61.92400   Max.   :5714.000   Max.   :3100.00  
##      fiber          folic_acid          sodium           calcium       
##  Min.   : 0.000   Min.   :  0.000   Min.   :    0.0   Min.   :   0.00  
##  1st Qu.: 0.000   1st Qu.:  0.000   1st Qu.:   15.0   1st Qu.:  11.00  
##  Median : 0.800   Median :  0.000   Median :   71.0   Median :  22.00  
##  Mean   : 2.505   Mean   :  6.636   Mean   :  320.4   Mean   :  88.09  
##  3rd Qu.: 2.800   3rd Qu.:  0.000   3rd Qu.:  325.0   3rd Qu.:  61.00  
##  Max.   :79.000   Max.   :273.000   Max.   :26000.0   Max.   :7364.00  
##       iron           magnesium        manganese            niacin      
##  Min.   :  0.000   Min.   :  0.00   Min.   :  0.0000   Min.   : 0.000  
##  1st Qu.:  0.440   1st Qu.: 12.00   1st Qu.:  0.0180   1st Qu.: 0.400  
##  Median :  1.020   Median : 21.00   Median :  0.1080   Median : 1.432  
##  Mean   :  2.178   Mean   : 40.61   Mean   :  0.6526   Mean   : 2.978  
##  3rd Qu.:  2.320   3rd Qu.: 32.00   3rd Qu.:  0.3723   3rd Qu.: 4.769  
##  Max.   :123.600   Max.   :781.00   Max.   :133.0000   Max.   :80.000  
##    phosphorus       potassium         riboflavin        selenium      
##  Min.   :   0.0   Min.   :    0.0   Min.   :0.0000   Min.   :   0.00  
##  1st Qu.:  37.0   1st Qu.:  135.0   1st Qu.:0.0470   1st Qu.:   0.70  
##  Median : 125.0   Median :  233.5   Median :0.1385   Median :   6.30  
##  Mean   : 175.8   Mean   :  342.7   Mean   :0.2126   Mean   :  15.22  
##  3rd Qu.: 224.0   3rd Qu.:  356.0   3rd Qu.:0.2500   3rd Qu.:  24.00  
##  Max.   :9918.0   Max.   :16500.0   Max.   :6.8000   Max.   :1917.00  
##     thiamin          vitamin_A         vitamin_B6      vitamin_B12     
##  Min.   : 0.0000   Min.   :    0.0   Min.   :0.0000   Min.   : 0.0000  
##  1st Qu.: 0.0320   1st Qu.:    0.0   1st Qu.:0.0490   1st Qu.: 0.0000  
##  Median : 0.0700   Median :   22.0   Median :0.1290   Median : 0.0000  
##  Mean   : 0.1855   Mean   :  847.5   Mean   :0.2395   Mean   : 0.8791  
##  3rd Qu.: 0.2000   3rd Qu.:  214.0   3rd Qu.:0.3400   3rd Qu.: 0.7225  
##  Max.   :10.9900   Max.   :77261.0   Max.   :8.0000   Max.   :83.1300  
##    vitamin_C         vitamin_D            zinc           group          
##  Min.   :   0.00   Min.   :   0.00   Min.   : 0.000   Length:2184       
##  1st Qu.:   0.00   1st Qu.:   0.00   1st Qu.: 0.270   Class :character  
##  Median :   0.15   Median :   0.00   Median : 0.770   Mode  :character  
##  Mean   :  10.45   Mean   :  19.54   Mean   : 1.721                     
##  3rd Qu.:   4.10   3rd Qu.:   5.00   3rd Qu.: 2.410                     
##  Max.   :2400.00   Max.   :1123.00   Max.   :90.950

This shows us that even though food amounts were standardized, it’s pretty clear that the variables measured are on vastly different scales. As such, we should be careful to scale any PCA performed.

nupca <- prcomp(nutrition[,-c(1,28)], scale.=TRUE)
summary(nupca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.1627 1.6632 1.56431 1.32432 1.27038 1.17873 1.14213
## Proportion of Variance 0.1799 0.1064 0.09412 0.06745 0.06207 0.05344 0.05017
## Cumulative Proportion  0.1799 0.2863 0.38041 0.44786 0.50993 0.56337 0.61354
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.07919 0.98092 0.95248 0.93091 0.86575 0.83505 0.81022
## Proportion of Variance 0.04479 0.03701 0.03489 0.03333 0.02883 0.02682 0.02525
## Cumulative Proportion  0.65834 0.69535 0.73024 0.76357 0.79240 0.81922 0.84446
##                           PC15    PC16    PC17   PC18    PC19    PC20   PC21
## Standard deviation     0.77617 0.76276 0.71123 0.6511 0.62883 0.60465 0.5679
## Proportion of Variance 0.02317 0.02238 0.01946 0.0163 0.01521 0.01406 0.0124
## Cumulative Proportion  0.86764 0.89001 0.90947 0.9258 0.94098 0.95504 0.9675
##                           PC22    PC23    PC24    PC25    PC26
## Standard deviation     0.55599 0.46560 0.41915 0.37748 0.04831
## Proportion of Variance 0.01189 0.00834 0.00676 0.00548 0.00009
## Cumulative Proportion  0.97934 0.98767 0.99443 0.99991 1.00000

Using the Kaiser criterion would suggest retaining 8 components, but that would only retain approximately 66% of the variability in the data. If we wanted to retain 90% of the variation, then we would need to keep 17 components. We can also check a scree plot

plot(nupca, type="lines")

but by default the x-axis isn’t long enough, let’s expand

plot(nupca, type="lines", xlim=c(1,26))

hmm, that didn’t work…little bit of searching shows that the function screeplot is the default plot method for a prcomp object, and the argument we want to increase is actually npcs…

plot(nupca, type="lines", npcs=26)

After all that work, there’s no clear picture of how many components to retain using a scree plot. Best argument might be for 3 components, but you’d be leaving a lot of variance on the table.

In any case, let’s take a look at the first three components

nupca$rotation[,1:3]
##                       PC1          PC2          PC3
## calories      -0.28984450  0.071740483 -0.402817933
## protein       -0.25221736  0.304280692  0.035072126
## carbohydrates -0.13433393 -0.358147046 -0.173526769
## total_fat     -0.19967152  0.248340853 -0.410578696
## saturated_fat -0.13209879  0.269016073 -0.389903929
## caffiene      -0.06873300 -0.137864557  0.010768664
## cholesterol   -0.08102362  0.325268544  0.043743444
## fiber         -0.21811173 -0.339734033 -0.055905551
## folic_acid    -0.03832779 -0.104949921 -0.112380806
## sodium        -0.04112324  0.017210793 -0.081622954
## calcium       -0.19496728 -0.125007935 -0.085652175
## iron          -0.25405163 -0.195183474 -0.002385225
## magnesium     -0.31744573 -0.217159458 -0.081421582
## manganese     -0.13249482 -0.218888570 -0.018023030
## niacin        -0.29089205  0.136000014  0.255038615
## phosphorus    -0.23840742  0.004174599 -0.090088264
## potassium     -0.24781734 -0.212430668  0.066966641
## riboflavin    -0.29358866  0.098618291  0.261987058
## selenium      -0.11671311  0.168235553 -0.029249888
## thiamin       -0.22754623 -0.008655121  0.070734365
## vitamin_A     -0.07072538 -0.069177326  0.251779136
## vitamin_B6    -0.28627061  0.060930755  0.319147647
## vitamin_B12   -0.09624647  0.258476489  0.202930449
## vitamin_C     -0.11195416 -0.092153385  0.278167093
## vitamin_D     -0.06235645  0.173711766  0.105342221
## zinc          -0.16433946  0.154609737  0.003061607

To make it more readable, let’s do some rounding

round(nupca$rotation[,1:3], 2)
##                 PC1   PC2   PC3
## calories      -0.29  0.07 -0.40
## protein       -0.25  0.30  0.04
## carbohydrates -0.13 -0.36 -0.17
## total_fat     -0.20  0.25 -0.41
## saturated_fat -0.13  0.27 -0.39
## caffiene      -0.07 -0.14  0.01
## cholesterol   -0.08  0.33  0.04
## fiber         -0.22 -0.34 -0.06
## folic_acid    -0.04 -0.10 -0.11
## sodium        -0.04  0.02 -0.08
## calcium       -0.19 -0.13 -0.09
## iron          -0.25 -0.20  0.00
## magnesium     -0.32 -0.22 -0.08
## manganese     -0.13 -0.22 -0.02
## niacin        -0.29  0.14  0.26
## phosphorus    -0.24  0.00 -0.09
## potassium     -0.25 -0.21  0.07
## riboflavin    -0.29  0.10  0.26
## selenium      -0.12  0.17 -0.03
## thiamin       -0.23 -0.01  0.07
## vitamin_A     -0.07 -0.07  0.25
## vitamin_B6    -0.29  0.06  0.32
## vitamin_B12   -0.10  0.26  0.20
## vitamin_C     -0.11 -0.09  0.28
## vitamin_D     -0.06  0.17  0.11
## zinc          -0.16  0.15  0.00

Let’s make it even more readable. Since we have lots of values here, it is often easier to view just the variables that surpass some threshold of absolute magnitude. For example…

loadi <- round(nupca$rotation[,1:3], 2)
loadi[abs(loadi)<0.2] <- NA
loadi
##                 PC1   PC2   PC3
## calories      -0.29    NA -0.40
## protein       -0.25  0.30    NA
## carbohydrates    NA -0.36    NA
## total_fat     -0.20  0.25 -0.41
## saturated_fat    NA  0.27 -0.39
## caffiene         NA    NA    NA
## cholesterol      NA  0.33    NA
## fiber         -0.22 -0.34    NA
## folic_acid       NA    NA    NA
## sodium           NA    NA    NA
## calcium          NA    NA    NA
## iron          -0.25 -0.20    NA
## magnesium     -0.32 -0.22    NA
## manganese        NA -0.22    NA
## niacin        -0.29    NA  0.26
## phosphorus    -0.24    NA    NA
## potassium     -0.25 -0.21    NA
## riboflavin    -0.29    NA  0.26
## selenium         NA    NA    NA
## thiamin       -0.23    NA    NA
## vitamin_A        NA    NA  0.25
## vitamin_B6    -0.29    NA  0.32
## vitamin_B12      NA  0.26  0.20
## vitamin_C        NA    NA  0.28
## vitamin_D        NA    NA    NA
## zinc             NA    NA    NA

Let’s start with the first component. This component has relatively low loading of nearly all the variables. Interestingly, all of the loadings (including those below the threshold) are negative. That is, every variable in the data set is loading in the same direction. Since it doesn’t undermine the underlying mathematics, let’s multiply the first component by negative 1.

loadi[,1] <- -loadi[,1]

Now we can reasonably interpret this component as something like “nutrient density”. We would expect any food item that scores high would be nutrient rich, and anything scoring low would be relatively void of nutrients (like water). Let’s see what the max and min food items are on this…and keep in mind that we would still need to multiply our scores by -1 since they were computed on the original loadings…

head(nutrition[order(-nupca$x[,1], decreasing=TRUE),1])
## [1] "Fruit-flavored drink, powder, with high vitamin C with other added vitamins, low calorie"
## [2] "Leavening agents, yeast, baker's, active dry"                                            
## [3] "Malted drink mix, chocolate, with added nutrients, powder"                               
## [4] "Rice bran, crude"                                                                        
## [5] "Spices, basil, dried"                                                                    
## [6] "Malted drink mix, natural, with added nutrients, powder"
tail(nutrition[order(-nupca$x[,1], decreasing=TRUE),1])
## [1] "Carbonated beverage, low calorie, cola or pepper-type, with aspartame, without caffeine"       
## [2] "Tea, herb, chamomile, brewed"                                                                  
## [3] "Tea, herb, other than chamomile, brewed"                                                       
## [4] "Carbonated beverage, low calorie, other than cola or pepper, with aspartame, contains caffeine"
## [5] "Carbonated beverage, club soda"                                                                
## [6] "Carbonated beverage, low calorie, other than cola or pepper,  without caffeine"

On first blush this appears a bit unintuitive as most of the entries on both lists are drinks. BUT READ CLOSELY! The first, third, and sixth most nutrient “dense” are actually POWDERED drink mixes. So 100g of those probably make a substantial amount of drink, and are likely packed with sugar (hence very high in calories) and in some cases it even mentions that they have added vitamins. In fact, all of the top 6 are concentrated/dried items. The bottom six, on the other hand, are liquid drinks specifically void of most nutrients…diet pop, sparkling water, and herbal tea.

Let’s move on to the second component…

loadi
##                PC1   PC2   PC3
## calories      0.29    NA -0.40
## protein       0.25  0.30    NA
## carbohydrates   NA -0.36    NA
## total_fat     0.20  0.25 -0.41
## saturated_fat   NA  0.27 -0.39
## caffiene        NA    NA    NA
## cholesterol     NA  0.33    NA
## fiber         0.22 -0.34    NA
## folic_acid      NA    NA    NA
## sodium          NA    NA    NA
## calcium         NA    NA    NA
## iron          0.25 -0.20    NA
## magnesium     0.32 -0.22    NA
## manganese       NA -0.22    NA
## niacin        0.29    NA  0.26
## phosphorus    0.24    NA    NA
## potassium     0.25 -0.21    NA
## riboflavin    0.29    NA  0.26
## selenium        NA    NA    NA
## thiamin       0.23    NA    NA
## vitamin_A       NA    NA  0.25
## vitamin_B6    0.29    NA  0.32
## vitamin_B12     NA  0.26  0.20
## vitamin_C       NA    NA  0.28
## vitamin_D       NA    NA    NA
## zinc            NA    NA    NA

Any item scoring high on PC2 will have high protein/fat/cholesterol/vitaminB, and low carbs/fiber/iron/magnesium/manganese/potassium. For the most part, it seems like a measure of “meatiness”. Let’s look at the top 6 and bottom 6 on this

head(nutrition[order(nupca$x[,2], decreasing=TRUE),1])
## [1] "Egg, yolk, dried"                                             
## [2] "Beef, variety meats and by-products, brain, cooked, simmered" 
## [3] "Beef, variety meats and by-products, liver, cooked, pan-fried"
## [4] "Egg, whole, dried"                                            
## [5] "Beef, variety meats and by-products, liver, cooked, braised"  
## [6] "Nuts, brazilnuts, dried, unblanched"
tail(nutrition[order(nupca$x[,2], decreasing=TRUE),1])
## [1] "Spices, cloves, ground"                          
## [2] "Spices, marjoram, dried"                         
## [3] "Spices, thyme, dried"                            
## [4] "Spices, basil, dried"                            
## [5] "Tea, instant, unsweetened, powder, decaffeinated"
## [6] "Tea, instant, unsweetened, powder"

For the most part our interpretation was correct (though I forgot how protein-rich and carb-less nuts were). At least one can’t get much further away from meat than unsweetened tea powder!

Finally, let’s look at component 3

loadi
##                PC1   PC2   PC3
## calories      0.29    NA -0.40
## protein       0.25  0.30    NA
## carbohydrates   NA -0.36    NA
## total_fat     0.20  0.25 -0.41
## saturated_fat   NA  0.27 -0.39
## caffiene        NA    NA    NA
## cholesterol     NA  0.33    NA
## fiber         0.22 -0.34    NA
## folic_acid      NA    NA    NA
## sodium          NA    NA    NA
## calcium         NA    NA    NA
## iron          0.25 -0.20    NA
## magnesium     0.32 -0.22    NA
## manganese       NA -0.22    NA
## niacin        0.29    NA  0.26
## phosphorus    0.24    NA    NA
## potassium     0.25 -0.21    NA
## riboflavin    0.29    NA  0.26
## selenium        NA    NA    NA
## thiamin       0.23    NA    NA
## vitamin_A       NA    NA  0.25
## vitamin_B6    0.29    NA  0.32
## vitamin_B12     NA  0.26  0.20
## vitamin_C       NA    NA  0.28
## vitamin_D       NA    NA    NA
## zinc            NA    NA    NA

Anything scoring high on this component will be low in calories and fat, but high in vitamins. So generally speaking, let’s just call it “vitamin-dense”.

head(nutrition[order(nupca$x[,3], decreasing=TRUE),1])
## [1] "Fruit-flavored drink, powder, with high vitamin C with other added vitamins, low calorie"
## [2] "Peppers, sweet, red, freeze-dried"                                                       
## [3] "Beef, variety meats and by-products, liver, cooked, pan-fried"                           
## [4] "Beef, variety meats and by-products, liver, cooked, braised"                             
## [5] "Malted drink mix, chocolate, with added nutrients, powder"                               
## [6] "Malted drink mix, natural, with added nutrients, powder"
tail(nutrition[order(nupca$x[,3], decreasing=TRUE),1])
## [1] "Pork, fresh, backfat, raw"                            
## [2] "Butter, whipped, with salt"                           
## [3] "Butter, without salt"                                 
## [4] "Butter, salted"                                       
## [5] "Nuts, coconut meat, dried (desiccated), not sweetened"
## [6] "Butter oil, anhydrous"

The bottom six are no brainers (basically just fat-based items), but the top six are again a bit strange. We have, once more, our powdered drink mixes, freeze dried peppers, and beef liver!

PCAReg on Nutrition data

Let’s continue with the nutrition data. We’ll keep the same lazy cleaning again just to keep things easy.

#load("~/Downloads/nutrition.rdata")
#dim(nutrition)
#nutrition <- na.omit(nutrition)
colnames(nutrition)
##  [1] "food"          "calories"      "protein"       "carbohydrates"
##  [5] "total_fat"     "saturated_fat" "caffiene"      "cholesterol"  
##  [9] "fiber"         "folic_acid"    "sodium"        "calcium"      
## [13] "iron"          "magnesium"     "manganese"     "niacin"       
## [17] "phosphorus"    "potassium"     "riboflavin"    "selenium"     
## [21] "thiamin"       "vitamin_A"     "vitamin_B6"    "vitamin_B12"  
## [25] "vitamin_C"     "vitamin_D"     "zinc"          "group"

Now then, to set this up in a PCRegression type context, we’ll need a natural continuous response variable. I think calories is likely the most natural approach since they can come primarily from both fat and carbs. As such, we will need to remove it from the original PCA call.

nupca <- prcomp(nutrition[,-c(1,2,28)], scale.=TRUE)
summary(nupca)
## Importance of components:
##                           PC1    PC2     PC3    PC4     PC5     PC6     PC7
## Standard deviation     2.0905 1.6613 1.45197 1.2787 1.24728 1.17387 1.13369
## Proportion of Variance 0.1748 0.1104 0.08433 0.0654 0.06223 0.05512 0.05141
## Cumulative Proportion  0.1748 0.2852 0.36953 0.4349 0.49715 0.55227 0.60368
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.07393 0.98081 0.95224 0.92905 0.85969 0.83157 0.78987
## Proportion of Variance 0.04613 0.03848 0.03627 0.03453 0.02956 0.02766 0.02496
## Cumulative Proportion  0.64982 0.68830 0.72457 0.75909 0.78865 0.81631 0.84127
##                           PC15    PC16    PC17    PC18    PC19    PC20   PC21
## Standard deviation     0.77604 0.76121 0.70758 0.64976 0.62872 0.59025 0.5678
## Proportion of Variance 0.02409 0.02318 0.02003 0.01689 0.01581 0.01394 0.0129
## Cumulative Proportion  0.86536 0.88854 0.90856 0.92545 0.94126 0.95520 0.9681
##                           PC22    PC23    PC24    PC25
## Standard deviation     0.53484 0.46515 0.40082 0.36685
## Proportion of Variance 0.01144 0.00865 0.00643 0.00538
## Cumulative Proportion  0.97954 0.98819 0.99462 1.00000

Again, using the Kaiser criterion would suggest retaining 8 components, but that would only retain approximately 65% of the variability in the data. In the previous lab we investigated the other standard approaches for component retention…let’s not bother for this lab. We’ll retain the first 8 components. So, let’s pop the scores of the original observations on those 8 components into an object for future use.

scr <- nupca$x[,1:8]

Now we can run PCReg for calories as a response…

pcregmod <- lm(nutrition$calories ~ scr)
summary(pcregmod)
## 
## Call:
## lm(formula = nutrition$calories ~ scr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -624.91  -40.83  -10.22   29.49  236.46 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 201.1200     1.5068 133.475  < 2e-16 ***
## scrPC1       35.9805     0.7210  49.907  < 2e-16 ***
## scrPC2       -5.0377     0.9072  -5.553 3.15e-08 ***
## scrPC3       51.3592     1.0380  49.479  < 2e-16 ***
## scrPC4       26.6879     1.1787  22.642  < 2e-16 ***
## scrPC5       55.3065     1.2083  45.770  < 2e-16 ***
## scrPC6       20.3169     1.2839  15.824  < 2e-16 ***
## scrPC7      -28.3220     1.3294 -21.304  < 2e-16 ***
## scrPC8       22.3124     1.4034  15.899  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.42 on 2175 degrees of freedom
## Multiple R-squared:  0.7969, Adjusted R-squared:  0.7962 
## F-statistic:  1067 on 8 and 2175 DF,  p-value: < 2.2e-16

Cool! Everything appears interesting and useful for modelling. Of course, our PCA was done unsupervised — therefore, without respect to what we were trying to model (calories). So perhaps PLS can help us improve…

PLS on nutrition

We will need the pls library here

#install.packages("pls")
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings

Now, we will feed the function plsr a formula for calories as a response to all other numeric predictors.

nuplsmod <- plsr(calories~., data=nutrition[,-c(1,28)], scale=TRUE, method="oscorespls")
summary(nuplsmod)
## Data:    X dimension: 2184 25 
##  Y dimension: 2184 1
## Fit method: oscorespls
## Number of components considered: 25
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           14.53    24.29    29.77    34.48    42.06    47.18    51.29
## calories    69.29    89.08    94.64    97.56    98.41    99.10    99.35
##           8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X           53.93    58.15     62.05     65.75     68.60     71.63     74.91
## calories    99.48    99.51     99.51     99.51     99.52     99.52     99.52
##           15 comps  16 comps  17 comps  18 comps  19 comps  20 comps  21 comps
## X            77.47     79.78     82.71     84.67     86.75     89.30     92.12
## calories     99.52     99.52     99.52     99.52     99.52     99.52     99.52
##           22 comps  23 comps  24 comps  25 comps
## X            94.66     96.10     97.89    100.00
## calories     99.52     99.52     99.52     99.52

This gives a very different picture…namely that it appears probably 2 components are sufficient (cumulatively explain just over 89% of the variation in calories). Compare that to PCReg — 8 components explained only 79% of the variation in calories. Clearly an improvement. In fact, if we only take the first two components for PCReg…

pcregmod2 <- lm(nutrition$calories ~ scr[,1:2])
summary(pcregmod2)
## 
## Call:
## lm(formula = nutrition$calories ~ scr[, 1:2])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -887.77  -92.51  -45.45   74.04  646.03 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    201.120      2.920  68.888   <2e-16 ***
## scr[, 1:2]PC1   35.981      1.397  25.758   <2e-16 ***
## scr[, 1:2]PC2   -5.038      1.758  -2.866   0.0042 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 136.4 on 2181 degrees of freedom
## Multiple R-squared:  0.2355, Adjusted R-squared:  0.2348 
## F-statistic: 335.8 on 2 and 2181 DF,  p-value: < 2.2e-16

Only 23%! Hence why PLS is generally a superior approach in a supervised context.

Caution

But how would even PLS compare to a domain-wise and straightforward model…

dwmod <- lm(calories~total_fat+carbohydrates, data=nutrition)
summary(dwmod)
## 
## Call:
## lm(formula = calories ~ total_fat + carbohydrates, data = nutrition)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -151.58  -40.60  -10.91   36.02  310.91 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   49.95804    1.47220   33.93   <2e-16 ***
## total_fat      9.63835    0.07792  123.69   <2e-16 ***
## carbohydrates  3.29893    0.03917   84.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.3 on 2181 degrees of freedom
## Multiple R-squared:  0.9081, Adjusted R-squared:  0.908 
## F-statistic: 1.078e+04 on 2 and 2181 DF,  p-value: < 2.2e-16

Just slightly more of the variation (90.8% versus 89.1%) in calories can be explained using an equivalent dimensionality from the original variables.

In machine learning, there is often no substitute for domain-knowledge!

Single linkage chaining

Let’s simulate a weird data set with 2 clusters and a line of points running in between. We’ll use manhattan distance for no particular reason other than showing how to specify it…

library(MASS)
set.seed(3173)
datagen1 <- mvrnorm(25, c(0,0), matrix(c(1,0,0,1),2,2))
datagen2 <- mvrnorm(25, c(5,-5), matrix(c(1,0,0,1),2,2))
datagen <- rbind(datagen1, datagen2)
x1 <- seq(from=0, to=5, by=0.2)
x2 <- seq(from=0, to=-5, by=-0.2)
newvalx <- cbind(x1,x2)
datagen <- rbind(datagen,newvalx)
plot(datagen)

mandist <- dist(datagen,method="manhattan")
clus1 <- hclust(mandist, method="single")
plot(clus1)

Using single linkage we have major difficulty finding an underlying group structure. The plot above is a tell-tale sign of the phenomena called “chaining” — essentially that you keep adding observations to one group for the majority of the algorithm. This problem tends to only be a concern with single linkage…

clus2 <- hclust(mandist, method="average")
plot(clus2)

Note that switching to euclidean distance doesn’t fix the chaining issue, but does result in slightly different groups

eucdist <- dist(datagen,method="euclidean")
clus3 <- hclust(eucdist, method="single")
plot(clus3)

#Euclidean, complete
clus4 <- hclust(mandist, method="complete")
plot(clus4)

Here we specify where to cut the tree by requesting a specific number of groups, and then we can compare results between the manhattan and euclidean clusterings…

mancom <- cutree(clus2,2)
euccom <- cutree(clus4,2)
table(mancom,euccom)
##       euccom
## mancom  1  2
##      1 39  0
##      2  0 37

No observations are grouped differently, here’s a plot for complete linkage on Manhattan distance.

plot(datagen, col=mancom)

Now euclidean with single linkage, and we will compare the results with euclidean with complete linkage.

eucsin <- cutree(clus3,2)
table(euccom,eucsin)
##       eucsin
## euccom  1  2
##      1 39  0
##      2 35  2

Single linkage makes two groups, one with only two observations in it!

Kmeans

Let’s code up k-means from scratch. We should be able to provide the function with the data x and the number of groups k.

It’s probably best to remind you of the steps of the kmeans algorithm first: 1. Start with k random centroids (let’s use points within the data) 2. Assign all observations to their closest centroid (by euclidean distance) 3. Recalculate group means. Call these your new centroids. 4. Repeat 2, 3 until nothing changes.

There are several ways of going about this. Sometimes its easier to just code the actual steps and then back up to setup the loop, etc, but I’ll provide the whole thing, with some comments to point out what we’re accomplishing with each chunk of the code.

my_k_means <- function(x, k){
  #1 start with k centroids
  centrs <- x[sample(1:nrow(x), k),]
  #start loop
  changing <- TRUE  
  while(changing){
    #2a) calculate distances between all x and centrs
    dists <- matrix(NA, nrow(x), k)
    #could write as a double loop, or could utilize other built in functions probably
    for(i in 1:nrow(x)){
      for(j in 1:k){
        dists[i, j] <- sqrt(sum((x[i,] - centrs[j,])^2)) 
      }
    }
    #2b) assign group memberships (you probably want to provide some overview of apply functions) 
    membs <- apply(dists, 1, which.min)
    
    #3) calculate new group centroids
    oldcentrs <- centrs #save for convergence check!
    for(j in 1:k){
      centrs[j,] <- colMeans(x[membs==j, ])
    }
    
    #4) check for convergence
    if(all(centrs==oldcentrs)){
      changing <- FALSE
    }
  }
  #output memberships
  membs
}

set.seed(5314)
#debug(my_k_means) #if you want
test <- my_k_means(scale(faithful), 2)
plot(faithful, col=test)

Compare to the built in version…

test2 <- kmeans(scale(faithful), 2)
plot(faithful, col=test2$cl)

Mouse simulation

We’ll talk more about this in the next week or so, but here’s a very simple way to see some of the hidden assumptions of k-means in action.

library(mvtnorm)
set.seed(35151)
le <- rmvnorm(400, mean = c(-5,7.5))
re <- rmvnorm(400, mean = c(5,7.5))
hd <- rmvnorm(400, mean = c(0,0), sigma=7*diag(2) )
dat <- rbind(le, re, hd)
plot(dat)

mickres <- my_k_means(scale(dat), 3)
plot(dat, col=mickres)

Note how the smaller mouse ears are assumed larger in k-means, even though the groups are relatively well-separated. We can see this by showing a side by side of the truth and what k-means just gave us

par(mfrow=c(1,2))
plot(dat, col=mickres)
plot(dat, col=rep(c(1,3,2), each=400))

What about heirarchical?

mickdist <- dist(scale(dat))
mickhc <- hclust(mickdist, method="average")
mickhcres <- cutree(mickhc, 3)
plot(dat, col=mickhcres)

Still not perfect, but not as bad. Some of the other linkage options are worse than what results from k-means as well.

Performance metrics

Now that we have some performance metrics and a few models under our belt, let’s do some competing of classification models on the body data set.

library(gclus)
## Loading required package: cluster
data(body)
body$Gender <- factor(body$Gender)

First up, random forests!

library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
set.seed(4521)
bodyRF <- randomForest(Gender~., data=body)
bodyRF
## 
## Call:
##  randomForest(formula = Gender ~ ., data = body) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 3.75%
## Confusion matrix:
##     0   1 class.error
## 0 250  10  0.03846154
## 1   9 238  0.03643725

We can see from the output several measures from the OOB observations. Misclassification rate is 0.0375. Now, given the lack of information in the printout (and the help file), we may want to verify whether the columns or rows of the confusion matrix are the ground truth! Easy…

table(body$Gender)
## 
##   0   1 
## 260 247

So the response variable has 247 ones (males). Thus from the printed results, we can deduce that the rows are the truth and the columns the prediction. We can use the MLmetrics library to compute some of the measures we’ve discussed.

library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall

Note that the LogLoss function is specifically for binary…there is also MultiLogLoss from the same library for multiclass scenarios.

Judging from help files (fairly sparse), my best guess for the binary version is that we feed it a probability of classifying 1 as the predicted vector…

LogLoss(predict(bodyRF, type="prob")[,2], body$Gender)
## Warning in Ops.factor(y_true, log(y_pred)): '*' not meaningful for factors
## Warning in Ops.factor(1, y_true): '-' not meaningful for factors
## [1] NA

Here’s an aggravation. Looks like they want the 0-1 true classification vector as numeric for computation purposes. Here’s a quick hack back…

LogLoss(predict(bodyRF, type="prob")[,2], as.numeric(body$Gender)-1)
## [1] 0.1141382

It’s worth pointing out that all of these measures are believable long-run estimates for those metrics since they are computed on the OOB classifications/probabilities.

Now LDA, note the CV=TRUE command. The help file suggests that this will provide us classifications and probabilities that correspond to LOOCV.

library(MASS)
bodylda <- lda(Gender~., data=body, CV=TRUE)
table(body$Gender, bodylda$class)
##    
##       0   1
##   0 257   3
##   1   6 241

So misclassification rate of 9/507 = 0.018.

Now for logloss…

LogLoss(bodylda$posterior[,2], as.numeric(body$Gender)-1)
## [1] 0.06311807

Comparing back to random forests, we can see improvement. Thus providing us an argument that LDA is a better long-run option (since these are cross-validated results).

Since sample sizes are large enough, it’s likely natural to remove the equal covariance matrix assumption of LDA in favour of QDA. Let’s see what happens!

bodyqda <- qda(Gender~., data=body, CV=TRUE)
table(body$Gender, bodyqda$class)
##    
##       0   1
##   0 256   4
##   1   6 241

Well, that’s not promising. One more misclassification!

LogLoss(bodyqda$posterior[,2], as.numeric(body$Gender)-1)
## [1] 0.1020329

The LogLoss is even more damning, as it approaches the LogLoss of the random forests model (which had significantly more misclassifications). This suggests that the misclassified observations may be under much more false confidence in QDA.

How about k nearest neighbours? Let’s just try for k=1 through k=300…

library(class)
knnruns <- list()
for(i in 1:300){
  knnruns[[i]] <- knn.cv(body[,-25], body$Gender, k=i, prob=TRUE)
}
misclass <- unlist(lapply(knnruns, function(v) 507-sum(diag(table(body$Gender, v)))))
plot(misclass)

which.min(misclass)
## [1] 3

These CV runs suggest that k=3 is best for long-run misclassifications. Now we can take the results from k=3 and check all of our other metrics…

knnbest <- knnruns[[3]]
table(body$Gender, knnbest)
##    knnbest
##       0   1
##   0 253   7
##   1   4 243

The probabilities returned from KNN are given for only the class which was predicted. This is fine for computing LogLoss when doing binary classification…just means a little bit of manual work. Basically we can sum up the -log(probs) of the returned probabilities for all the correctly classified, and -log(1-probs) for the 11 misclassifications we see in the table, and then divide by n…

probs <- attr(knnbest, "prob")
(sum(-log(probs[body$Gender==knnbest])) + sum(-log(1-probs[body$Gender!=knnbest])))/507
## [1] Inf

Shoot…an Inf. This means we are 100% confident in at least one incorrect classification. It also means we need a little more work to get a more reasonable LogLoss estimate. We will force the value to be the maximum of the predicted probability or 1e-15…

missedprobs <- 1-probs[body$Gender!=knnbest]
missedprobs[missedprobs==0] <- 1e-15
probs[probs==0] <- 1e-15
(sum(-log(probs[body$Gender==knnbest])) + sum(-log(missedprobs)))/507
## [1] 0.3052576

Well that looks pretty bad! But wait, we replaced 0’s with an extremely small value (1e-15), so note that

-log(1e-15)
## [1] 34.53878

is likely ballooning the average substantially. How much of an effect? Well, let’s change to 1e-3…

probs <- attr(knnbest, "prob")
missedprobs <- 1-probs[body$Gender!=knnbest]
missedprobs[missedprobs==0] <- 1e-3
probs[probs==0] <- 1e-3
(sum(-log(probs[body$Gender==knnbest])) + sum(-log(missedprobs)))/507
## [1] 0.08726142

Which is much more in line with some of the other models. Though, perhaps models should be more heavily penalized when the say with certainty that there is NO CHANCE that an observation belongs to the CORRECT GROUP!

In any case, we can see that even though KNN results in the same number of misclassifications as LDA, it is overconfident in those incorrect classifications (and also perhaps underconfident in some of its correct choices).

All signs point to LDA as the best model on this data set. If you take a look at some scatterplots, it does seem that normality could possibly hold (approximately) for a lot of the predictor variables, and so in this sense it’s probably not all that surprising that LDA works very well.

Furthermore, the assumption of equal covariance matrices for physical measurements on both men and women is not really an outrageous assumption to make. An example of the ramifications of this assumption: on average, any increase in height will lead to an equivalent increase in weight, regardless of reported gender (even if there are different mean vectors across the reported genders).

Mixture Models

Keeping in mind that mixture model families can often take some time to fit, we will restrict the number of models considered during many of the model fits for brevity of the lab. I certainly encourage you to explore the softwares in more depth.

First, let’s run the popular mclust package on the faithful data set

#install.packages("mclust")
#install.packages("teigen")
library(mclust)
## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## The following object is masked from 'package:mvtnorm':
## 
##     dmvnorm
mfaith <- Mclust(faithful, G=1:5)

Rather than scrolling a large table of values, you can view a plot of the BIC values. The X-axis is the number of groups for the clustering solution (called components of the mixture model), and the Y-axis is the BIC (model fitting measure). Each of the covariance structures we discussed in class has a different line on the plot.

plot(mfaith, what="BIC")

The classification plot will provide a visual of each mixture component overlaid on top of the fitted groups.

plot(mfaith, what="classification")

The uncertainty plot gives you a visual of which observations have relatively unclear classification. The larger the point appears, the more uncertain in the classification we are.

plot(mfaith, what="uncertainty")

Finally, the density plot will provide the contours of the actual fitted mixture model.

plot(mfaith, what="density")

Other software is available for similar models under different distributional assumptions. Here’s an example on some of Jeff’s software, which uses the multivariate t distribution and eigendecomposition constraints on the covariance matrix.

library(teigen)
tfaith <- teigen(faithful, Gs=1:5, model="UUUU", scale=FALSE, verbose = FALSE)
plot(tfaith, what="uncertainty")

plot(tfaith, what="contour")

The faithful data set is a true unsupervised problem as we don’t have known labels. But we can take a look at some of our supervised data sets to investigate how these clustering methods perform with respect to known groups…

library(gclus)
data(wine)
twine <- teigen(wine[,-1], Gs=1:5, model="UCCU", scale=FALSE, verbose=FALSE)
plot(twine, xmarg=1, ymarg=7, what="contour")

plot(twine, xmarg=1, ymarg=10, what="uncertainty")

table(wine[,1], twine$classification)
##    
##      1  2  3
##   1  0  0 59
##   2 66  5  0
##   3  0 48  0
plot(twine$allbic, type="l")
points(twine$allbic)

And so we can see (for at least one covariance structure from tEIGEN), that the G=3 group model is chosen on the wine data via the BIC, and the resulting clusters match quite closely to the known group varietals in the data (5 misclassifications).

We can check the ARI using the function from mclust…

adjustedRandIndex(wine[,1], twine$classification)
## [1] 0.9188052